Document zone content classification and its performance evaluation
نویسندگان
چکیده
This paper describes an algorithm for the determination of zone content type of a given zone within a document image.We take a statistical based approach and represent each zone with 25 dimensional feature vectors. An optimized decision treeclassifier is used to classify each zone into one of nine zone content classes. A performance evaluation protocol is proposed.The training and testing datasets include a total of 24, 177 zones from the University of Washington English Document Imagedatabase III. The algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%. SummaryA document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Giventhe segmented document zones, correctly determining the zone content type is very important for the subsequent processeswithin any document image understanding system. This paper describes an algorithm for the determination of zone typeof a given zone within an input document image. In our zone content classification algorithm, zones are represented as 25dimensional feature vectors. The feature vector includes our new signature-like background analysis structure which is goodto study statistical characteristic of a given zone. A decision tree classifier is used to classify each zone into one of nineclasses on the basis of its feature vector. The protocol by which the decision tree classifier is optimized eliminates the dataover-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within theirneighboring zones. We model zone class context constraints as a Hidden Markov Model and use Viterbi algorithm to obtainoptimal classification results. The training, pruning and testing datasets for the algorithm include a total of 1600 imagesfrom the University of Washington English Document Image database III. With a total of 24, 177 zones within the data set, across-validation method was used in the performance evaluation of the classifier. The classifier is able to classify each givenscientific and technical document zone into one of nine classes, 2 text classes (of font size 4− 18pt and font size 19− 32 pt),math, table, halftone, map/drawing, ruling, logo, and others. Using our zone content classification performance evaluationprotocol, the algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.
منابع مشابه
Document Zone Content Classification Using Decision Tree and HMM
A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. This paper describes an algorithm to classify each given document zone into one of nine different classes. Foreground and background features are studied. We used an optimized binary decision tree to estimate the maximum zone content class probability in one set while used Viter...
متن کاملA Study on the Document Zone Content Classification Problem
A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input doc...
متن کاملZone Content Classification and its Performance Evaluation
This paper presents an improved zone content class$cation method and its performance evaluation. We added two new features to the feature vector from one previously published method [l]. We assumed different independence relationship in two zone sets. We used an optimized binary decision tree to estimate the maximum Zone content class probability in one set while used Viterbi algorithm to find ...
متن کاملA Method for Document Zone Content Classification
This paper describes an algorithm to classify each given document zone into one of nine classes and provides a protocol for its performance evaluation. The classification scheme uses an optimized binary decision tree and Viterbi algorithm for HMM to find the optimal solution. Our algorithm was trained and tested on a total of 24,177 zones within the 1600 images from UWCDROM III database. Its ac...
متن کاملPage Layout Classification Technique for Biomedical Documents
The structural layout information of scanned document pages is valuable for a wide range of document processing applications such as automatic document searching, document delivery and automated data entry. This paper describes the classification of scanned document pages into different classes of physical layout structures. The page layout classification technique proposed in this paper uses a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Pattern Recognition
دوره 39 شماره
صفحات -
تاریخ انتشار 2006